This HTML notebook will replicate the STATA do file named vnm_mics14_dp2019 for distribution in R. The goal is to create an R script that does what the said do file does.
I will follow the same section numbering as in the do file for ease of comparision.
Before we go into the excercise, following are the packages from which we will use various functions.
library(here)
## here() starts at D:/R projects/OPHI/Vietnam Translation/Vietnam-MICS-MPI-to-R
library(tidyverse)
## -- Attaching packages --------
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor)
## Warning: package 'janitor' was built under R version 3.6.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(haven)
library(gt)
## Warning: package 'gt' was built under R version 3.6.3
library(readr)
The memory and environment clearing commands and the commands for setting working folders and paths are not needed if one is working in Project mode through RStudio and uses the {here} (read as package here). This is a library for managing paths and directories.
__ Selecting main variables from CH, WM, HH & MN recode & merging with HL recode __
It should be noted that anthropometric data was not collected for children under 5 as part of the Viet Nam MICS 2014 dataset. Previously, nutrition data was collected as part of Viet Nam MICS 2011. However, the data was not collected in this round due to time and resource constraints as well as the availability of national nutrition survey data (p.61)
Above comments are copied from the STATA do file
According to the STATA do file there is no data for this section.
The purpose of step 1.2 is to identify children of any age who died in the last 5 years prior to the survey date. As seen in the STATA file
Loading the data from bh.sav.
bh_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"bh.sav"))
bh_dat <- clean_names(bh_dat)
The above code chunk loads the bh.sav and names it bh_dat(a data object). The clean names function gets all the variable names in lower snake case. IN case if anyone is wondering what are all the possible cases, please refere to the wonderful art by Allison Horst shown below.
Various Cases (Let me know your favourite)
Now let us take a glimplse at the data and names of the variables.
names(bh_dat)
## [1] "hh1" "hh2" "ln" "bhln" "bh2" "bh3"
## [7] "bh4m" "bh4y" "bh5" "bh6" "bh7" "bh8"
## [13] "bh9u" "bh9n" "bh10" "bh4c" "bh4f" "bh9c"
## [19] "bh9f" "hh6" "hh7" "wdoi" "wdob" "ethnicity"
## [25] "welevel" "brthord" "magebrt" "birthint" "wmweight" "wscore"
## [31] "windex5" "wscoreu" "windex5u" "wscorer" "windex5r" "windex2"
head(bh_dat) %>% gt()
| hh1 | hh2 | ln | bhln | bh2 | bh3 | bh4m | bh4y | bh5 | bh6 | bh7 | bh8 | bh9u | bh9n | bh10 | bh4c | bh4f | bh9c | bh9f | hh6 | hh7 | wdoi | wdob | ethnicity | welevel | brthord | magebrt | birthint | wmweight | wscore | windex5 | wscoreu | windex5u | wscorer | windex5r | windex2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 2 | 1 | 1 | 1 | 4 | 2011 | 1 | 2 | 1 | 4 | NA | NA | NA | 1336 | 1 | NA | NA | 1 | 1 | 1368 | 1044 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.640778 | 5 | 1.337577 | 5 | NA | NA | 2 |
| 1 | 2 | 2 | 2 | 1 | 1 | 5 | 2013 | 1 | 0 | 1 | 5 | NA | NA | 2 | 1361 | 1 | NA | NA | 1 | 1 | 1368 | 1044 | 1 | 4 | 2 | 2 | 2 | 2.167035 | 1.640778 | 5 | 1.337577 | 5 | NA | NA | 2 |
| 1 | 3 | 2 | 1 | 1 | 2 | 3 | 2003 | 1 | 10 | 1 | 4 | NA | NA | NA | 1239 | 1 | NA | NA | 1 | 1 | 1368 | 939 | 1 | 3 | 1 | 2 | 1 | 2.167035 | 1.427797 | 5 | 1.071892 | 5 | NA | NA | 2 |
| 1 | 3 | 2 | 2 | 1 | 1 | 6 | 2007 | 1 | 6 | 1 | 5 | NA | NA | 2 | 1290 | 1 | NA | NA | 1 | 1 | 1368 | 939 | 1 | 3 | 2 | 2 | 4 | 2.167035 | 1.427797 | 5 | 1.071892 | 5 | NA | NA | 2 |
| 1 | 4 | 2 | 1 | 1 | 2 | 11 | 2007 | 1 | 6 | 1 | 4 | NA | NA | NA | 1295 | 1 | NA | NA | 1 | 1 | 1368 | 1008 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.613568 | 5 | 1.303632 | 5 | NA | NA | 2 |
| 1 | 6 | 2 | 1 | 1 | 1 | 3 | 1995 | 1 | 18 | 1 | 3 | NA | NA | NA | 1143 | 1 | NA | NA | 1 | 1 | 1368 | 788 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.624926 | 5 | 1.317802 | 5 | NA | NA | 2 |
Now let us create variables, only those that are non grouped, i.e. thhose generated without using bysort. In R one can create grouped varibales in a different manner. Will show that in the code chunk following this one.
bh_dat <- bh_dat %>%
mutate(
ind_id = structure(((hh1*100000)+(hh2*100)+ln),
label = "Individual ID"),
date_death = structure(bh4c + bh9c,
label = "Date of Death CMC"),
mdead_survey = structure(ifelse(((bh9c == 0 |is.na(bh9c)) & bh5 == 1),
NA,
wdoi - date_death),
label = "Months Dead from Survey"),
ydead_survey = structure(mdead_survey/12,
label = "Years dead from Survey"),
age_death = structure(ifelse(bh5 == 2, bh9c, NA),
label = "Age at death in Months"),
child_died = structure(forcats::as_factor(
ifelse(is.na(bh5),NA,
ifelse(bh5==1,"child is alive","child is dead")))),
child18_died = forcats::as_factor(
ifelse((!is.na(age_death) & age_death > 216),
ifelse(is.na(bh5),NA,
ifelse(bh5==1,
"child is alive",
"child is dead")),
"child is alive"))
)
Creating grouped variables.
Notice that I need not write equivalent R code for the following STATA code chunks:
1> This has been included in the grouped condition for the variable
replace tot_child18_died_5y=. if child18_died==1 & ydead_survey==. //Replace as ‘.’ if there is no information on when the child died
2> This need not be written as the structure of data is different, only one rwo exists per ind_id NOte that due to this feature there is no need to create “child18u_died_per_wom_5y” as it is same as “tot_child18_died_5y”
bysort ind_id: egen childu18_died_per_wom_5y = max(tot_child18_died_5y) lab var childu18_died_per_wom_5y “Total child under 18 death for each women in the last 5 years (birth recode)”
3> This need not be written as the structure of data is different, only one rwo exists per ind_id
//Keep one observation per women bysort ind_id: gen id=1 if _n==1 keep if id==1 drop id duplicates report ind_id
bh_dat %>%
group_by(ind_id) %>%
summarise(
tot_child_died = sum(child_died == "child is dead", na.rm = T),
tot_child18_died_5y = sum(
child18_died[ydead_survey <= 5 & !is.na(ydead_survey)] == "child is dead", na.rm = T),
women_BH = 1
) %>%
mutate(
tot_child18_died_5y = ifelse(
(is.na(tot_child18_died_5y) & tot_child_died>=0 & !is.na(tot_child_died)),0,tot_child18_died_5y)
)-> grouped_bh_dat
Writing new files.
write_csv(bh_dat, here("output data files","1.2Intermediate_VNM14_BH.csv"))
write_csv(grouped_bh_dat,here("output data files","1.2VNM14_BH.csv")) # same as VNM14_BH.dta